Preparing Clean Views of Data for Data Mining

نویسندگان

  • Paul Jermyn
  • Maurice Dixon
  • Brian J Read
چکیده

Good data preparation is a key prerequisite to successful data mining [P99]. Conventional wisdom suggests that data preparation takes about 60 to 80% of the time involved in a data mining exercise [R97]. There have been good reviews of the problems associated with data preparation [F97, HS98 and MS97]. However the data cleaning aspect of data preparation is regarded as involving major human input and often has been neglected in practice. This paper reports work undertaken in support of a data mining programme at Rutherford Appleton Laboratory (RAL). It proposes a Clean Views Model for data cleaning. The action taken to clean data depends on the business purpose of the analysis. The cleaning action also depends on what you believe about the data expressed in the business rules that the data need to satisfy. These rules are expressed in a validator function . What you believe about the data can change during the data mining process and also business purposes change through time. This means that data cleaning should be described as an iterative exercise rather than a one-off procedure. We contend that the original, unclean data should be retained and that the cleaning produces a particular view, a Clean View, for a particular validation function. We have developed the Clean Views approach after looking at several datasets. We report some work on the flow of funds data associated with transitional economies [Ba99]. These data were generated by the central banks of Eastern European countries and are used to assess their movement towards a market economy. We also report work done on data in an international mailing list concerned with a survey of alumni of business schools who have graduated as Masters of Business Administration.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

a swift heuristic algorithm base on data mining approach for the Periodic Vehicle Routing Problem: data mining approach

periodic vehicle routing problem focuses on establishing a plan of visits to clients over a given time horizon so as to satisfy some service level while optimizing the routes used in each time period. This paper presents a new effective heuristic algorithm based on data mining tools for periodic vehicle routing problem (PVRP). The related results of proposed algorithm are compared with the resu...

متن کامل

High performance of the support vector machine in classifying hyperspectral data using a limited dataset

To prospect mineral deposits at regional scale, recognition and classification of hydrothermal alteration zones using remote sensing data is a popular strategy. Due to the large number of spectral bands, classification of the hyperspectral data may be negatively affected by the Hughes phenomenon. A practical way to handle the Hughes problem is preparing a lot of training samples until the size ...

متن کامل

Perform Three Data Mining Tasks with Crowdsourcing Process

For data mining studies, because of the complexity of doing feature selection process in tasks by hand, we need to send some of labeling to the workers with crowdsourcing activities. The process of outsourcing data mining tasks to users is often handled by software systems without enough knowledge of the age or geography of the users' residence. Uncertainty about the performance of virtual user...

متن کامل

Data Preparation for Data Mining

Practical experience of data mining has revealed that preparing data is the most time-consuming phase of any data mining project. Estimates of the amount of time and resources spent on data preparation vary from at least 60% to upward of 80% (SPSS, 2002a). In spite of this fact, not enough attention is given to this important task, thus perpetuating the idea that the core of the data mining eff...

متن کامل

A field investigation of application of digital terrestrial photogrammetry to characterize geometric properties of discontinuities in open-pit slopes

In order to analyze the slope stability in open-pit mines, the structural parameters of rock mass such as persistence and spatial orientation of discontinuities are characterized through field surveys, which involve spending high costs and times as well as posing high risks of rock toppling and rock fall. In the present work, a new application of terrestrial digital photogrammetry is introduced...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000